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Abstract 


In this paper we relate the partition function 
to the max-statistics of random variables. 
In particular, we provide a novel framework 
for approximating and bounding the parti- 
tion function using MAP inference on ran- 
domly perturbed models. As a result, we can 
use efficient MAP solvers such as graph-cuts 
to evaluate the corresponding partition func- 
tion. We show that our method excels in the 
typical “high signal - high coupling” regime 
that results in ragged energy landscapes dif- 
ficult for alternative approaches. 


1. Introduction 


Learning and inference in complex models drives much 
of the research in machine learning applications, from 
computer vision, natural language processing, to com- 
putational biology. Examples include object detec- 
tion (Felzenszwalb et al., 2009), stereo vision (Szeliski 
et al., 2007), parsing (Koo et al., 2010), or protein 
design (Sontag et al., 2008). The inference problem 
in such cases involves assessing the likelihood of pos- 
sible structures, whether objects, parsers, or molecu- 
lar structures. The structures are specified by assign- 
ments of random variables that need to be maximized 
or summed over. However, it is often feasible to only 
find the most likely or maximum a-posteriori (MAP) 
assignment rather than considering all possible assign- 
ments. Indeed, substantial effort has gone into devel- 
oping algorithms for recovering MAP assignments, ei- 
ther based on specific structural restrictions such as 
super-modularity (Kolmogorov, 2006) or by devising 
approximate methods based on linear programming re- 
laxations (Sontag et al., 2008; Werner, 2008). 


MAP inference is limited when there are other likely 
assignments. For example, in pose estimation the re- 
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covery of the 3D joint positions from 2D images is of- 
ten inherently ambiguous. Similarly, in parsing there 
might be equally likely parse trees for the same sen- 
tence. In a fully probabilistic treatment, all possible 
alternative assignments are considered. This requires 
summing over the assignments with their respective 
weights — evaluating the partition function — which is 
considerably harder. In fact, any algorithm for com- 
puting the partition function would be able to approx- 
imate the MAP value arbitrarily well via a tempera- 
ture argument (Landau & Lifshitz, 1980). In contrast, 
MAP inference (maximization) can be tractable even 
when the problem of evaluating the partition function 
(weighted counting) is not. 


The main surprising result of our work is that MAP in- 
ference can be used to approximate the partition func- 
tion. In other words, given an algorithm for computing 
the MAP value, we can approximate or bound the par- 
tition function. The MAP values in our case arise from 
randomly perturbed models. While models based on 
random perturbations have been considered recently 
(Papandreou & Yuille, 2011; Keshet et al., 2011; Tar- 
low et al., 2012), their relation to the partition function 
has not. Specifically, we relate the partition function 
to the expected MAP value of perturbations. This 
result enables us for the first time to directly use ef- 
ficient MAP solvers such as graph-cuts or MPLP in 
calculating the partition function. The approach ex- 
cels in regimes where there are several but not expo- 
nentially many prominent assignments. For example, 
this happens in cases where observations carry strong 
signals (local evidence) but are also guided by strong 
consistency constraints (couplings). 


We begin by introducing the notation and the count- 
ing and maximization problems of interest. We sub- 
sequently relate the partition function to the max- 
statistics of random variables, and introduce new ap- 
proximations and bounds on the partition function 
based on random MAP perturbations. Finally, we de- 
scribe how to use this method in the context of condi- 
tional random fields and demonstrate the effectiveness 
of the approach. 


Random MAP perturbations 


2. Background 


Here we briefly define the counting and maximization 
problems of interest. Throughout the paper we assume 
real valued potentials @(y) = $(y1, ---, Yn) < co defined 
over a discrete product space Y = Yı x--- x Yn. The 
domain is implicitly defined through ¢(y) via exclu- 
sions ¢(y) = —oo whenever y ¢ dom/(@). 


The Gibbs distribution maps the real valued potential 
functions to the probability scale 


P(Y Yn) = 5 POs Yn) (1) 
Z = J exp(d(yrs-+Yn))- (2) 


where the normalization constant Z is also known as 
the partition function. The feasibility of using such 
a distribution for inference and learning is inherently 
tied to the ability to evaluate the partition function. In 
the special case, where (y) € {—00,0} the partition 
function reduces to counting the number of allowed 
configurations, namely |dom(¢)|. In general, counting 
problems are considered very hard, many belonging to 
the complexity class #P (Valiant, 1979). 


We can also express the maximum a-posteriori (MAP) 
inference problem in the same notation as 


(MAP) lyi, satn): (3) 


where, again, the domain of ¢(y) is implicitly consid- 
ered since the maximization avoids all configurations 
outside the domain. Although the MAP problem is 
NP-hard in general (Shimony, 1994), it is easier than 
computing the partition function. It can be solved ef- 
ficiently in many cases of practical interest, e.g. when 
ọ(y) is a super-modular function. A number of algo- 
rithms based on linear programming relaxations have 
recently been developed for solving MAP problems. 
Although the run-time of these solvers can be expo- 
nential, they are often surprisingly effective in prac- 
tice. 


3. Max-Statistics 


In the following we describe the basis of our frame- 
work. We show how to realize the partition func- 
tion as the expected value of random MAP pertur- 
bations. Analytic expressions for the statistics of a 
random MAP perturbation can be derived for gen- 
eral discrete sets, whenever independent and identi- 
cally distributed random perturbations are applied for 
every assignment y € Y. Let {y(y)}yey be a col- 
lection of random variables. Assume these random 


variables are independent and identically distributed 
with F(t) as their cumulative distribution function, 
ie. F(t) = Ply(y) < t] for each y € Y. The indepen- 
dence of y(y) across y € Y implies that the cumulative 
distribution function of the random MAP perturba- 
tion maxycy {o(y)+7(y)} is the product of cumulative 
distribution functions of the individual perturbations 
oly) + yy). Therefore P[maxyey {o(y) + yly)} < t] 
equals to [J ey F(t — ¢(y)). Below we describe the 
max-stability of the Gumbel distribution and its ex- 
pect value. 


Lemma 1. Let {y(y)}yey be a collection of indepen- 
dent random variables y(y) indexed by y € Y, each 
following the Gumbel distribution whose cumulative 
distribution function is F(t) = exp(—exp(—(t + c))), 
where c is the Euler constant. Then the random vari- 
able maxycy{¢(y) + y(y)} is distributed according to 
the Gumbel distribution and its expected value is the 
logarithm of the partition function: 


log Z = E,| max{o(y) + vW): (4) 


Proof: The Gumbel cumulative distribution function 
is closed under multiplication, namely 


[] FE- oy) = expl- SS exp(-(t - ly) +9) 
yeY yey 
= exp(— exp(—(t+ c))Z) = F(t — log Z). 


Therefore the random variable maxycy {¢(y) + y(y)} 
has the Gumbel distribution whose expected value is 
the logarithm of the partition function. 


In general each y = (y1,.--,Yn) represents an assign- 
ment to n variables. In this case the theorem suggests 
introducing an independent perturbation 7(y) for each 
such n—dimensional assignment y € Y. The complex- 
ity of evaluating the log-partition function in this man- 
ner would be exponential in n. In the following we use 
lower dimensional random MAP perturbations as the 
main tool for approximating and bounding the parti- 
tion function. 


4. Low Dimensional Perturbations 


In this section, we develop efficient approximations 
and bounds for the partition function based on low 
dimensional random MAP perturbations. We com- 
mence with rewriting the previous result by exploiting 
the structure of the product space. 


Theorem 1. Let {7;(yi)}y,e¥;,i=1,...,.n» be a collection 
of independent and identically distributed (i.i.d.) ran- 
dom variables following the Gumbel distribution with 
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F(t) = exp(— exp(—(t+c))) where c is the Euler con- 
stant. Define yi = {yi(yi) }y,ey,- Then 


log Z = Ey, max - Ey, max{(y) + `z yilyi)}- 
. i=1 


Proof: The result follows from applying equation (4) 
iteratively. Intuitively, Z = 52, +: $, exp(o(y)) 
and equation (4) encodes the correspondence between 
summation and expectation-maximization, namely 
Xy,  Exi(y:) MAXy,. More formally, consider the re- 
cursion $;-1(y1,-.-, ¥i-1) = Ey, maxy; {Qi(Y1; ---, Yi) + 
Yilyi)}, where On(y1,---,Yn) = O(Y1,---,Yn)- Equa- 
tion (4) implies that for each i, d;-1(y1,...,yi-1) = 
log dy, XP(Gi(Y1,--, Yi). The rest of the proof now 
follows by induction. 


The computational complexity of the alternating pro- 
cedure is still exponential in n. For example, the in- 
ner iteration Ey, max,, {O(y1,..,Y¥n) + In(Yn)} needs 
to be estimated exponentially many times, i.e., for ev- 
ery Y1,---,Yn—1- Thus from computational perspective 
the alternating formulation in Theorem 1 is just as in- 
efficient as the formulation in equation (4). We show 
next how Theorem 1 can be used to derive effective 
bounds and approximations. 


4.1. Upper Bounds on the Partition Function 


Theorem 1 directly provides easily computable upper 
bounds on the log-partition function. Intuitively, these 
bounds correspond to moving expectations outside the 
maximization operations, each move resulting in an 
additional bound but also reducing the computational 
effort needed for the evaluation. For example, 


log Z < By max{o(y) + >D qilyi)} (5) 


follows immediately from moving all the expectations 
in front. In this case the bound is a simple average of 
MAP values corresponding to models with only sin- 
gle node perturbations {7;(y:)}y,ey;,i=1,....n. If the 
maximization over ¢(y) is feasible (e.g., due to super- 
modularity), it will typically be feasible after such per- 
turbations as well. The upper bound can thus be eval- 
uated efficiently as a sample average. We generalize 
this basic result further below. 


Corollary 1. Consider a family of subsets a € A such 
that Uacaa = {1,...,n}, and let Ya be a set of variables 
{Yi}ica restricted to the indexes in a. Assume that 
the random variables ya(Ya) are i.i.d. according to the 
Gumbel distribution, for every a, Ya. Then 


log Z < Ey | max TORD Yalva)}]. 
arg acA 


Proof: If the subsets a are disjoint, then {Ya}acA 
simply defines a partition of the variables in the model. 
We can therefore use equation (5) over these grouped 
variables. In the general case, a,a’ € A may over- 
lap. We lift the variables y1,...,y, to a larger set 
y = {y,}ae.a where an independent set of variables is 
introduced for each a € A. We lift the potentials to 
¢'(y’) by including consistency constraints among the 
lifted variables 
$'(y') = lyi, 9 Yn) if Va, i - Q: Yai = Vi 
—0o otherwise 

Thus, log Z = }/,, exp(¢'(y’)) since inconsistent set- 
tings receive zero weight. Moreover, maxy{¢'(y’) + 
Ea Valya) } equals maxy{¢(y) + Xa Ya(Yo)} for each 
realization of the perturbation. This equality holds 
after expectation over y as well. Now, given that the 
perturbations are independent for each lifted coordi- 
nate, the basic result in equation (5), guarantees that 


E,| maxy {¢' (y) + oq Valya) }] upper bounds log Z. 
This completes the proof. 


The structure of a C {1,...,n} determines the statis- 
tical quality and algorithmic efficiency of the method. 
For example, using a single set a = {1,...,n} the up- 
per bound turns to be the exact characterization in 
equation (4), but it requires exponentially many inde- 
pendent random variables. 


4.2. Approximating the Partition Function 


We can also use Theorem 1 to derive sampling based 
approximation schemes for the partition function that 
also utilize efficient MAP solvers. As a simple step, 
we could just replace the expectations in Theorem 1 
with sampled estimates. The main subtlety lies in how 
these samples are reused as part of the outer loop max- 
imization steps. 


Let’s begin by considering the n-th dimension alone. 
For any given setting of yj,...,Yn—1, we toss ran- 
dom values yn(yn), for every yn € Yn, to estimate 
log ae exp(¢(y1,---;Yn)). Since the operation has to 
be repeated for each different setting of y1,..-,Yn—1, 
we need m-|Y| random variables Yn,j (41, ---, Yn—1; Yn); 
for every j = 1,...,m, to ensure that 


1 m 
m 5 max{o(y1, Yn) + Yn, (Yrs Yn) J- 


n 


j=1 


approximates log D exp(@(y1,-.-,Yn)) across all 
Y1,--+;Yn—1- We can rewrite this expression in terms 
of a single maximization problem over an extended set 
of variables. Specifically, we introduce copies yn,; for 
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each sample index 7 = 1,...,m, and pull the maxi- 


mizations outside the average: 


m 


max — $ ($ur -s Yn) + Ing (Yas Yn): 
j=l 


This result is an approximation to the last expectation- 
maximization step in Theorem 1. We can now repeat 
the procedure for dimension n—1. Formally, we are us- 
ing the fact that the partition function is self-reducible, 
ie, logZ = log) y.y, exP(log D1, exp((y)))- 
After repeating this step for all the dimensions, we 
write the resulting approximation as a single but very 
large MAP program: 


log Z ~ max 
y 


w o( Yl, joe +) Ynja) oT 


ee 


$i mi x Viji (Yije Deeds 


-n ji=1 


In this expression, the maximization is over an inflated 
set of variables where there are m copies of each vari- 
able, e.g., yi is expanded into {yi j; }j:=1,...m. The 
problem is that this program uses exponentially many 
independent random variables Yi j; (Y1,i1; +--+ Yi,j,;) and 
thus suffers from the same computational problem as 
equation (4). However, due to averaging, we expect 
that most MAP perturbations are now concentrated 
around the logarithm of the partition function. 


For computational efficiency we suggest reusing the 
perturbations, collapsing independent random vari- 
ables Yi j; (Y1,i1; +--+» Yi,j,) to far fewer random variables 
Jiji (Yi,j,)- This results in the following approximation 
for the log-partition function: 


a o( Yi jas sees Yn,jn )+ DP rel Yi,ji) 


iji 


max 
Yr,ji o nin m 


This approximation is effective whenever the domi- 
nant configurations of variables {y;,;,} were already 
correlated (few modes) so that the randomness in 
Jiji (Y1,i1 + ++ Yi, jı) Could be compressed with little loss 
in accuracy. This behavior is typical in models with 
strong couplings. 


4.3. Lower Bounds on the Partition Function 


For completeness, we also provide a lower bound on 
the partition function based on randomized MAP com- 
putations. Unlike the upper bound discussed earlier, 
however, the lower bound does not directly exploit 
Theorem 1. 


Theorem 2. Consider any collection of subsets a C 
{1,... n} and let {ya(ya)} be independent random 
variables for each setting of a,ya. Let Kaya (à) = 
log Elexp(AYa(Ya))]. Then log Z > 


sup 198 Bs | exp(naxy{6y) +A Za Fa(¥a)})] 
A>0 [— max, Xa Kaya (A) 


The proof appears in the supplementary material. The 
lower bound is somewhat weaker than the upper bound 
in Corollary 1. For example, unlike the upper bound, 
in case of a single a = {1,...,n}, the lower bound is 
not tight for every (y). The parameter À governs 
the tradeoff between the perturbed MAP value and 
the cumulant generating function Kay, (À). For \ = 0 
the cumulant generating function is zero and the lower 
bound reduces to a trivial bound based on the MAP 
value. However, we cannot choose arbitrarily large 
since Kaya (à) can be unbounded depending on the 
distributions used for perturbations. 


5. Conditional Random Fields 


In a supervised learning problem, we assume train- 
ing data S of objects x € X and labels y € Y such 
as images and their segmentations. Given a feature 
vector (x,y) < oo for each object x € Æ and la- 
bel y € Y, the learning task is to estimate param- 
eters 0 that maximize the log likelihood of the data 
under the conditional random field model p(y; 0) = 
exp(67 ®(zx, y))/Z,(@). This task can be equivalently 
stated as a loss minimization problem 


min S (Jog Ze(0) — 07 ®(@,y)). 


(x, y)ES 


The formulation emphasizes the computational cost 
of using conditional random fields due to the partition 
function. In the following we make use of a surrogate 
partition function arising from random MAP pertur- 
bations. 


Consider a family of subsets a C {1,...,n} that cover 
{1,... n}, and let {ya(Ya)}a,y, be iid. random vari- 
ables with continuous densities. We use the following 
surrogate criterion for estimating 0: J(@) = 


D. (Bylmax{6" (x,y) + 7 Ya(va)}-8" P(e, v) 


(z,y)ES 
The theorem below establishes some basic properties 
of this criterion. 


Theorem 3. Under the above assumptions, J(0) is 
convex and smooth, and its gradient enforces the mo- 
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Figure 1. Comparing random MAP perturbations with tree 
re-weighted and belief propagation estimations for the log- 
partition on 10 x 10 spin glass model with weak and strong 
local field potentials. The plots describe the absolute esti- 
mation error, averaged over 100 trials. 


ment matching constraints 


210) E (Fote) - aey) 


(x,y)ES y! 
where 
x def i A 
Pey’) F Ply’ € argmac{9™ &(2, 0) + vaba). 
Yis Yn a 


In particular, if YalYa) have the Gumbel distribution, 
J(0) upper bounds the conditional random field loss. 


Proof: The loss function is convex in 0, cf. (Rock- 
afellar, 1974) Theorem 3. Corollary 1 implies that 
the surrogate loss upper bounds the partition-loss. To 
compute the gradient we use Theorem 23 in (Rockafel- 
lar, 1974) to differentiate under the integral. The sub- 
gradient of the max-function is the indicator function 
over the maximum argument, cf. Proposition 4.5.1 in 
(Bertsekas et al., 2003). Since the expectation over the 
indicator function results in a probability distribution, 
the surrogate loss is smooth and the gradient takes the 
above form. 


The structure of a C {1,...,n} determines the statisti- 
cal quality and algorithmic efficiency of the approxima- 
tion. For example, if we use a single set a = {1,..., n}, 
this formulation is an exact characterization of the 
conditional random fields, and its gradient describes 
the standard moment matching condition. 


6. Empirical Evaluation 


We evaluated our approach on spin glass models 


Plun yn) =X dilun) t Do gisu ys). 


iEV (i, j)EE 


where y; € {—1,1}. Each spin has a local field param- 
eter ;(y;) = fiy; and interacts in a grid shaped graph- 
ical structure with couplings 1,5 (Yas Ys) = Oi jYiYj- 
Whenever the coupling parameters are positive the 
model is called attractive as adjacent variables give 
higher values to positively correlated configurations. 
We used low dimensional random perturbations ¥;(y;) 
since such perturbations do not affect the complexity 
of the MAP solver. 


Evaluating the partition function is challenging when 
considering strong local field potentials and coupling 
strengths. The corresponding energy landscape is 
ragged, and characterized by a relatively small set of 
dominating configurations. The energy and probabil- 
ity landscapes are presented in the supplementary ma- 
terial. In the following, we show that the random MAP 
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perturbations approach performs better than previous 
approaches in this setting. 


We evaluated the performance of our method on 10x10 
spin glass. The local field parameters 6; were drawn 
uniformly at random from |- f, f], where f € {0.1,1} 
to reflect weak and strong potentials. The parame- 
ters 0; j were drawn uniformly from [0,c] or [—c, c] to 
obtain attractive or mixed coupling potentials. The 
following algorithms were used to estimate the parti- 
tion function: 


e The random MAP perturbation approximation, 
described in Section 4.2, was executed by inflat- 
ing the graphical model to 1000 x 1000 grid. When 
dealing with attractive potentials this was effi- 
ciently evaluated using graph-cuts (Boykov et al., 
2001). This approximation method could not be 
applied to mixed potentials. 


e The random MAP perturbation upper bound, de- 
scribed in Section 4.1, with perturbations 7; (y;). 
The expectation was computed using 100 random 
MAP perturbations, although very similar results 
were attained after only 10 perturbations. The 
MAP was computed in the attractive case using 
graph-cuts and in the mixed case using MPLP 
(Sontag et al., 2008). 


e The sum-product form of tree re-weighted belief 
propagation with uniform distribution over the 
spanning trees (Wainwright et al., 2005). 


e Sum-product belief propagation. Whenever the 
algorithm did not converge we extracted its par- 
tition function approximation from its messages. 


We computed the absolute error in estimating the log- 
arithm of the partition function, averaged over 100 
spin glass models, see Fig. 1. One can see that the 
random MAP perturbation approximation works very 
well and gives a very accurate estimation using a sin- 
gle MAP value, when considering strong coupling po- 
tentials. Also, the random MAP perturbation upper 
bound is better than the tree re-weighted upper bound 
in the strong signal domain. Considering the attrac- 
tive case, the random MAP perturbation approxima- 
tion and upper bound used the graph-cuts algorithm, 
therefore were considerably faster than the belief prop- 
agation variants. The sum-product belief propagation 
performs well on the average, but from the plots one 
can observe its variance. This demonstrates the typi- 
cal behavior of belief propagation, as it minimizes the 
non-convex Bethe free energy, thus works well on some 
instances and does not converge or attain bad local 
minima on others. 


+ i 


train / test 


model ours SVM-struct 


Figure 2. From left to right: (a) Binary 100 x 70 image. 
(b) A representative image in the training set and the test 
set, where 10% of the pixels are randomly flipped. (c) A 
de-noised test image with our method: The test set error 
is 1.8%. (d) A de-noised test image with SVM-struct: The 
pizel base error is 8.2%. 


We note that the lower bounds of random MAP per- 
turbations, described in Section 4.3, are qualitatively 
the same as the MAP value, and are outperformed by 
standard techniques such as the mean-field. We omit- 
ted these experiments. 


We also demonstrated the effectiveness of random 
MAP perturbations in supervised learning. The train- 
ing data was composed of ten 100 x 70 binary images, 
consisting of a man silhouette and a random binary 
noise, described in Fig. 2. Each image x is described 
by binary local features ¢;(x, yi) which encodes if the 
i-th pixel is foreground or background, and pairwise 
features ¢;,;(yi, ys) which encourages adjacent features 
to have the same label. The goal is to estimate the pa- 
rameters 6;,0;,; to de-noise the images. Since we are 
considering 100 x 70 images, there are about 20, 000 pa- 
rameters to estimate. Conditional random fields can- 
not be evaluated on this problem, as the partition func- 
tion cannot be computed to general graphs with many 
cycles. However, the MAP can be efficiently estimated 
using MPLP, thus we applied our approximate con- 
ditional random fields described in Section 5. Using 
the estimated parameters, the pixel based error on the 
test set was 1.8%. As mentioned before, this approach 
cannot be compared with conditional random fields. 
However, without perturbations this program relates 
to structured-SVM (cf. (Tsochantaridis et al., 2006)) 
and can be evaluated through MAP solvers. For this 
case, the pixel based error on the test set was 8.2%. 


7. Related Work 


Throughout this work, we estimate the partition func- 
tion while computing the max-statistics of collections 
of random variables. We refer the interested reader 
to (Kotz & Nadarajah, 2000) for more comprehensive 
introduction to extreme value statistics. 


To the best of our knowledge the expected value of 
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Random MAP perturbations over discrete product 
spaces has not been extensively studied. Talagrand 
((Talagrand, 1994), Proposition 4.3) was the first to 
use random MAP perturbations in discrete settings. 
However, their approach differs from ours in that their 
goal was to upper bound the size of dom(@) C {0,1}” 
using random variables with the Laplace distribution. 
The proof technique is based on a compression argu- 
ment, and does not extend to the partition function. 
Restricting to @(y) € {—020,0}, Corollary 1 presents 
an alternative technique with weighted assignments. 
Another upper bound for dom(@) C {0,1}” was de- 
scribed in ((Barvinok & Samorodnitsky, 2007), Theo- 
rem 3.1). Their approach used the induction method 
of (Talagrand, 1995) to prove an upper bound using 
the logistic distribution. Our Corollary 1 provides 
an alternative technique for this result. They also 
extend their upper bound to functions of the form 
Lane Deak Ii. y,=1 G Where q are rational numbers. 


In this work we also consider parameter estimation 
using approximate conditional random fields. While 
computing the gradient we obtained the known re- 
sult that the Gibbs distribution can be described by 
the maximal argument of random MAP perturba- 
tion. This result is widely used in economics, pro- 
viding a probabilistic interpretation for choices made 
by people among a finite set of alternatives. Specifi- 
cally, the probability of choosing an alternative P[ĝ € 
argmax, {¢(y) + y(y)}] follows the Gibbs distribution 
whenever 7(y) are independent and distributed ac- 
cording to the Gumbel distribution (McFadden, 1974). 
This approach is computationally intractable when 
dealing with discrete product spaces, as it considers 
n-dimensional independent perturbations. This moti- 
vated efficient ways to approximately sample from the 
Gibbs distribution, through a probability distribution 
of the form: P[j € argmax,{(y) +X a Ya(Ya)}H, (Pa- 
pandreou & Yuille, 2011). In particular, the gradient 
suggested in Theorem 3 was described in (Papandreou 
& Yuille, 2011). For a more general class of proba- 
bilistic models that exploit efficient optimization we 
refer to (Tarlow et al., 2012). Whenever the pertur- 
bations occur in the feature space, random MAP per- 
turbation models relate to PAC-Bayes generalization 
bounds (Keshet et al., 2011). Other surrogate proba- 
bility models using computational structures appears 
in (Papandreou & Yuille, 2010; Kulesza & Taskar, 
2010). 


More broadly, methods for estimating the partition 
function were subject to extensive research over the 
past decades. Gibbs sampling, Annealed Importance 
Sampling and MCMC are typically used for estimating 
the partition function (cf. (Koller & Friedman, 2009) 


and references therein). These methods are slow when 
considering ragged energy landscapes, and their mix- 
ing time is typically exponential in n. In contrast, 
perturbed MAP operations are unaffected by ragged 
energy landscapes provided that the MAP is feasible. 


Variational approaches have been extensively devel- 
oped to efficiently estimate the partition function in 
large-scale problems. These are often inner-bound 
methods where a simpler distribution is optimized as 
an approximation to the posterior in a KL-divergence 
sense. The difficulty comes from non-convexity of the 
set of feasible distributions (e.g., mean field) (Jor- 
dan et al., 1999). Variational upper bounds on the 
other hand are convex, usually derived by replacing 
the entropy term with a simpler surrogate function 
and relaxing constraints on sufficient statistics (see, 
e.g., (Wainwright et al., 2005)). 


8. Discussion 


Evaluating the partition function and computing MAP 
assignments of variables are key sub-problems in ma- 
chine learning. While it is well-known that the ability 
to compute the partition function also leads to a vi- 
able MAP algorithm, the reverse is not. We showed 
here that a randomly perturbed MAP solver can ap- 
proximate and bound the partition function. The re- 
sult enables us to take advantage of efficient MAP 
solvers. Moreover, we demonstrated the effectiveness 
of our approach in the ”high-signal high-coupling” 
regime which dominates machine learning applications 
and is traditionally hard for current methods. We 
also applied our approximation to conditional random 
fields, describing the objective function to the moment 
matching algorithm of (Papandreou & Yuille, 2011). 


The bounds we presented hold with expectation. In 
practice we compute the empirical mean, and standard 
techniques in measure concentration, e.g. Chebyshev’s 
inequality, describe how the sampled mean relates to 
the expected value. 


The results here can be taken in a number of different 
directions. The surrogate probability model, emerging 
from Theorem 3, is based on the maximal argument of 
perturbed MAP program. This surrogate probability 
model directly measures the robustness of prediction. 
Our approach also suggests a new entropy approxi- 
mation, which can be derived as the conjugate-dual 
of the surrogate log-partition function. This entropy 
approximation is used with a valid set of probability 
distributions induced by perturbations. This is in con- 
trast to current approximations, e.g. based on Bethe 
entropies, defined over the local marginal polytope. 


Random MAP perturbations 
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